Hezi Buba
12/06/19
knitr::opts_chunk$set(error = TRUE)
library(raster)
source("enviroment_data.R")
source("enviroment_data_future.R")
rm(list=ls())
halophila_coordinates<-read.csv("results_asparagopsis/coordinates.csv")
save_path <- "results_asparagopsis/sdm/only minsst maxsst_coordinates_moved_med_asparagopsis/"
...
source("model_selection_formula.R")
...
source("predictors_selection.R")
source("sdm_model_poly.R")
source("combined_model_new.R")
The use of source() allows you to break down your project into logical segments.
read.csv() instead of using an RStudio's “Import Dataset” - goes without saying.
Lack of setwd("~Users/personal/path/which/only/collaborator/has/").
Results subfolder.
However:
rm(list = ls()). DO NOT USE IT IN SCRIPTS! here() searches and returns your project root directory.
library("here")
setwd("C:/")
getwd()
[1] "C:/"
here()
[1] "C:/Users/eze36/Documents/multiscriptgoodpracticres"
library("ggplot2")
library("here")
plot <- ggplot(mtcars) +
aes(x = wt, y = mpg)+
geom_point()
dir.create(here("figures"))
ggsave(here("figures","demofig1.png"))
Eventually you will either:
Writing your script with those scenerios in mind will minimize errors caused by depending on your current R session settings. From theory to practice:
Do not assume that a packages is attached or even installed!
Here are the start of randomly selected scripts from this project:
source("ex1_globals.r")
library(ggplot2)
####
library(nlme)
library(minpack.lm)
library(AICcmodavg)
library(sp)
library(raster)
library(BIOMOD)
library(stats4)
####
For some odd reason I just couldn't install ecospat, one of the used packages.
After a few frustrating days I realized ecospat is used only for one function througout the whole script!
boyce_index <- ecospat.boyce(data_m_p_BO,presence_data, window.w = 0.2)
Luckily, GitHub exists. Searching for the package there led me to the repo, where the function was just waiting for me.
Don't write install.packages() in your script.
Use library() instead of require().
Write a library loading script.
Use the pacman package to automatically install uninstalled packages
Use the packrat package to contain the packages within your project.
packrat::init("project/path")
#####libraries needed:
is_inst <- function(pkg) {
nzchar(system.file(package = pkg))
}
if (!is_inst("pacman")) {install.packages("pacman")}
library("pacman")
p_load(tidyverse,pbapply,doParallel)
Restart and rerun your code often.
Don't set any .RProfile settings such as stringsAsFactors = FALSE.
Do not save and load .RData!
Hey what about those week-long-analysis results?! ARE YOU MAD?!?
saveRDS() to save the results object..RDS object on the next logical script segment.The source code is real. The objects are realizations of the source code. - from the ESS manual
Check out the attached script sample_script.R
As you can probably see, there are many repetitions in the code. Especially towards the end.
To understand why we prefer to write functions than to copy+paste + replace, and learn more on how to write functions, click here.
coef_b <- matrix(nrow = 0, ncol=5)
for (i in 1:100){
...
...
m1Results <- coda.samples(model1, m1Pars, settings$samples*settings$thin, thin=settings$thin)
#analyzing the data:
coef_b <- rbind(coef_b, as.matrix(m1Results))
}
This code snippet will run slowly and take a lot of memory. Let's demonstrate using the bench package.
Let's set two functions.
f1 <- function() {
result <- matrix(nrow = 0, ncol=100)
for (i in seq_len(100)){
output <- matrix(1:1e5,ncol = 100)
result <- rbind(result, output)
}
result1 <- as.data.frame(result)
return(result1)
}
f2 <- function() {
list.of.result <- lapply(seq_len(100), function(i)
as.data.frame(matrix(1:1e5,ncol = 100)))
result2 <- do.call(what = dplyr::bind_rows,args = list.of.result)
}
bench::mark(f1(),f2())
# A tibble: 2 x 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 f1() 2.18s 2.18s 0.459 2.03GB 13.8
2 f2() 317.29ms 337.76ms 2.96 230.75MB 13.3
Ohana means family. Family means nobody gets left behind or forgotten - Lilo and Stitch
The apply family is a great way to iterate. Most commonly used:
lapply() - takes a list as input, returns a list.apply() - takes an array as input (matrix or dataframe), returns either a list or an array.apply(matrix(1:100,10,10),2,mean)
[1] 5.5 15.5 25.5 35.5 45.5 55.5 65.5 75.5 85.5 95.5
mysteryfunction <- function(matrix) {
return(apply(matrix,2,mean)
}
What does this function do?
apply_colmeans <- function(matrix) {
return(apply(matrix,2,mean))
}
Let's compare it with colMeans()
matrix <- matrix(runif(1e6),1000)
bench::mark(colMeans(matrix),apply_colmeans(matrix))
# A tibble: 2 x 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 colMeans(matrix) 803us 833.2us 1117. 32.3KB 0
2 apply_colmeans(matrix) 12ms 13.8ms 69.0 19.2MB 21.9
colMeans() is much faster because it is a vectorized function. If you have the need for speed, read more about vectorization here.
An amazing package that displays a progress bar whenever you use an *apply() function.
pbapply::pbapply(matrix(1:100,10,10),2,mean)
[1] 5.5 15.5 25.5 35.5 45.5 55.5 65.5 75.5 85.5 95.5
…
Well, not here. But try it on your machines!
Also… pbapply allows for parallel computing…
In my quest for speeding things up in this project I turned to parallel processing.
What is parallel processing?
R sadly only makes use of one core of your CPU. So, if you have 8 cores - R will only use 1/8th of your CPU potential!
And this is where parallel processing comes into play - you perform your computation parallelly on multiple cores.
Let's demonstrate:
bench::mark(pbapply::pblapply(1:8,function(i) Sys.sleep(10)))
# A tibble: 1 x 6
expression min median `itr/sec`
<bch:expr> <bch> <bch:> <dbl>
1 pbapply::pblapply(1:8, function(i) Sys.sleep(10)) 1.34m 1.34m 0.0124
# ... with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
As you can see, this code takes 8*10 = 80seconds to run if you don't parallelize it.
Let's repeat the above example. But first, we will set up and register a cluster.
cluster <- parallel::makeCluster(8)
doParallel::registerDoParallel(cluster)
This tells R to register 8 cores, where each will run a single R process.
We continue with the apply:
bench::mark(pbapply::pblapply(1:8,function(i) Sys.sleep(10),cl = cluster))
# A tibble: 1 x 6
expression min
<bch:expr> <bch>
1 pbapply::pblapply(1:8, function(i) Sys.sleep(10), cl = cluster) 10s
# ... with 4 more variables: median <bch:tm>, `itr/sec` <dbl>,
# mem_alloc <bch:byt>, `gc/sec` <dbl>
We finish be closing the cluster with:
parallel::stopCluster(cluster)
objects are not automatically exported to clusters!
a <- c(1:8)
f <- function(i) {return(i*2)}
cluster <- parallel::makeCluster(8)
doParallel::registerDoParallel(cluster)
pbapply::pblapply(1:8,function(i) f(a[i]) ,cl = cluster)
Error in checkForRemoteErrors(val): 8 nodes produced errors; first error: could not find function "f"
parallel::stopCluster(cluster)
a <- c(1:8)
f <- function(i) {return(i*2)}
cluster <- parallel::makeCluster(8)
doParallel::registerDoParallel(cluster)
parallel::clusterExport(cluster,c("a","f"))
pbapply::pblapply(1:8,function(i) f(a[i]) ,cl = cluster)
[[1]]
[1] 2
[[2]]
[1] 4
[[3]]
[1] 6
[[4]]
[1] 8
[[5]]
[1] 10
[[6]]
[1] 12
[[7]]
[1] 14
[[8]]
[1] 16
parallel::stopCluster(cluster)
This applies to packages as well.. I pass the library to each cluster call like so:
#for example
parallel::clusterCall(clusters, function() library(tidyverse))
What's comfortable for you may not be as intuitive to your collaborators.
What's comfortable for you right now may not be as intuitive to future you!
Make sure to adopt good practices early on, so you can easily work on other people's code and they can easily understand and work on your code.
R is an evolving language - it is good to stay updated.
Programming is a skill - with time and practice we all get better.